Skip to content

Add AGENTS.md and enrich package docstring#1497

Open
timsaucer wants to merge 7 commits intoapache:mainfrom
timsaucer:feat/create-user-agent-file
Open

Add AGENTS.md and enrich package docstring#1497
timsaucer wants to merge 7 commits intoapache:mainfrom
timsaucer:feat/create-user-agent-file

Conversation

@timsaucer
Copy link
Copy Markdown
Member

@timsaucer timsaucer commented Apr 15, 2026

Which issue does this PR close?

Addresses part of #1394 (PR 1a from the implementation plan)

Rationale for this change

AI agents (and humans) that encounter datafusion via pip install currently get a 2-line module docstring and no structured guide to the DataFrame API. This makes it difficult for agents to produce idiomatic DataFrame code, even though they are very capable with SQL. The goal is that any agent -- whether it encounters the package via pip, the docs site, or the repo -- gets enough context to write correct DataFrame code.

What changes are included in this PR?

  1. python/datafusion/AGENTS.md (new) -- comprehensive DataFrame API guide that ships with pip install datafusion (Maturin includes all files under python-source = "python"). Covers:

    • What DataFusion is and core abstractions
    • Import conventions and data loading
    • All DataFrame operations with examples (select, filter, join, aggregate, window, sort, limit, set operations, deduplication)
    • Executing and collecting results
    • Expression building (arithmetic, comparisons, boolean logic, null handling, CASE/WHEN, casting, aliasing, BETWEEN, IN)
    • SQL-to-DataFrame reference table (~25 mappings)
    • Common pitfalls (boolean operators, lit() wrapping, column quoting, immutable DataFrames, window frame defaults, HAVING pattern)
    • Idiomatic patterns (fluent chaining, variables as CTEs, window functions for scalar subqueries, semi/anti joins for EXISTS/NOT EXISTS)
    • Categorized function index
  2. python/datafusion/__init__.py (modified) -- enriched module docstring from 2 lines to a full overview with core abstractions, a quick-start example, and a pointer to AGENTS.md.

  3. AGENTS.md (modified, root) -- clarified that the root file is for contributors working on the project, and added a pointer to python/datafusion/AGENTS.md for agents that need to use the DataFrame API.

Are there any user-facing changes?

Yes -- the datafusion package now ships with an AGENTS.md guide and has a richer module docstring visible via help(datafusion). No API changes.

timsaucer and others added 2 commits April 15, 2026 09:51
Add python/datafusion/AGENTS.md as a comprehensive DataFrame API guide
for AI agents and users. It ships with pip automatically (Maturin includes
everything under python-source = "python"). Covers core abstractions,
import conventions, data loading, all DataFrame operations, expression
building, a SQL-to-DataFrame reference table, common pitfalls, idiomatic
patterns, and a categorized function index.

Enrich the __init__.py module docstring from 2 lines to a full overview
with core abstractions, a quick-start example, and a pointer to AGENTS.md.

Closes apache#1394 (PR 1a)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The root AGENTS.md (symlinked as CLAUDE.md) is for contributors working
on the project. Add a pointer to python/datafusion/AGENTS.md which is
the user-facing DataFrame API guide shipped with the package. Also add
the Apache license header to the package AGENTS.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer marked this pull request as draft April 15, 2026 13:54
timsaucer and others added 4 commits April 15, 2026 10:02
Document that all PRs must follow .github/pull_request_template.md and
that pre-commit hooks must pass before committing. List all configured
hooks (actionlint, ruff, ruff-format, cargo fmt, cargo clippy, codespell,
uv-lock) and the command to run them manually.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Let the hooks be discoverable from .pre-commit-config.yaml rather than
maintaining a separate list that can drift.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Clarify that DataFusion works with any Arrow C Data Interface
  implementation, not just PyArrow.
- Show the filter keyword argument on aggregate functions (the idiomatic
  HAVING equivalent) instead of the post-aggregate .filter() pattern.
- Update the SQL reference table to show FILTER (WHERE ...) syntax.
- Remove the now-incorrect "Aggregate then filter for HAVING" pitfall.
- Add .collect() to the fluent chaining example so the result is clearly
  materialized.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@timsaucer
Copy link
Copy Markdown
Member Author

Positive update: After my latest push 4429a08 it now correctly creates an idiomatic datafusion-python file for the first TPC-H query using only the text description from the specification and being directly to strictly not use the SQL as a reference. I didn't feed it the SQL but I gave it those instructions so it didn't find the answer during it's searching. When I get more time I plan on working through each one of the queries until we have an agent file that can reproduce all of TPC-H with idiomatic code.

@timsaucer
Copy link
Copy Markdown
Member Author

FYI @ntjohnson1 you might get some value out of grabbing the python/datafusion/AGENTS.md file but this is still a work in progress.

@ntjohnson1
Copy link
Copy Markdown
Contributor

FYI @ntjohnson1 you might get some value out of grabbing the python/datafusion/AGENTS.md file but this is still a work in progress.

Thanks for the heads up @iblnkn is going to do some query work in the short term so would be good to try this out in addition to some of the internal AGENTS.md stuff we have.

@timsaucer
Copy link
Copy Markdown
Member Author

With my latest push I have a folder that contains only the text descriptions of the TPC-H queries and I gave it this guidance:

Review the @README.md and @AGENTS.md in this directory. Each of the problem statements is listed in @problems/ . I want you to generate solutions for each problem statement. However when you do this you are forbidden from making any changes to your solution after your first evaluation. This is an attempt to test that our agents file contains all of the necessary instructions, so you should be able to get each one right on the first attempt.

The contents of README.md was:

DataFusion Python - TPC-H Queries

Overview

This project implements TPC-H benchmark queries using idiomatic datafusion-python code. The goal is to translate natural language problem descriptions into DataFrame API queries, not to transliterate SQL into Python.

Data

TPC-H parquet files are located in the data/ directory:

  • customer.parquet
  • lineitem.parquet
  • nation.parquet
  • orders.parquet
  • part.parquet
  • partsupp.parquet
  • region.parquet
  • supplier.parquet

Approach

Each query should be written as idiomatic datafusion-python, using the DataFrame
API with fluent chaining, col()/lit() expressions, and functions from the functions module. Solutions should keep data in Arrow-native formats and avoid unnecessary conversions to Python types.

Allowed Sources

  • AGENTS.md — local copy of the datafusion-python DataFrame API guide
  • datafusion-python documentation at https://datafusion.apache.org/python/
  • Problem descriptions in the problems/ directory

Restrictions

  • Do not use or analyze any TPC-H SQL queries. Solutions must be derived from the natural language problem descriptions alone, not by translating SQL.

Additionally I have a CLAUDE.md file with:

Do not store auto-memory for this folder. The user is developing and testing skills here, and cross-session memory may bias how skills get written or evaluated between runs. Do not write to ~/.claude/projects/-Users-tsaucer-working-agentic-dfpython/memory/ — no feedback, user, project, or reference memories.

Do not read prior query solutions under solutions/ when writing a new query. Each query must be derived only from AGENTS.md (and the resources it points to) plus the problem description in problems/. The goal is to build up AGENTS.md as the sole durable guide; cross-referencing other solutions biases new queries toward patterns that may or may not be captured in the guide, and hides gaps we want to surface. This applies even for "style matching" — if a style convention matters, it belongs in AGENTS.md, not inferred from siblings.

Whenever you hit a problem while generating a query — a DataFusion error, a surprising planner rejection, a type mismatch, an API quirk not covered by the existing guide — after resolving it, propose a concrete addition or edit to AGENTS.md so a future agent does not repeat the mistake. Phrase the proposal as a short recommendation (the rule, a minimal wrong/right example, and where it should live in the file) and wait for user approval before editing AGENTS.md. Since memory is disabled for this folder, AGENTS.md is the only durable channel for these lessons.

Results

Using this it created all 22 TPC-H queries. I then validated that they all work at scale factor 1 and produce the expected results. I also checked each file to make sure it created idiomatic code.

@timsaucer timsaucer marked this pull request as ready for review April 17, 2026 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants